NurseBridge (NB) wants to create a data driven predictive model allowing rural healthcare systems to make more informed decisions regarding compensation offerings at the most granular unit of time possible.
Goal is to use this data to categorize risk of death by day of the week on a county level.
Underlying cause of death 2018-21
Two datasets were joined, one listing month and weekday and their associated deaths by county; the other listing total population and total deaths by county. Crude death rate was also included and is essentially annual deaths per population.
TX_all %>% glimpse()## Rows: 85,344
## Columns: 9
## $ state <fct> TX, TX, TX, TX, TX, TX, TX, TX, TX, TX, TX, T…
## $ county <fct> Anderson, Anderson, Anderson, Anderson, Ander…
## $ county_code <fct> 48001, 48001, 48001, 48001, 48001, 48001, 480…
## $ year <dbl> 2018, 2018, 2018, 2018, 2018, 2018, 2018, 201…
## $ month <ord> Jan, Jan, Jan, Jan, Jan, Jan, Jan, Feb, Feb, …
## $ weekday <fct> Sunday, Monday, Tuesday, Wednesday, Thursday,…
## $ daily_county_deaths <chr> "Suppressed", "14", "Suppressed", "Suppressed…
## $ annual_county_deaths <chr> "692", "692", "692", "692", "692", "692", "69…
## $ annual_county_population <dbl> 58057, 58057, 58057, 58057, 58057, 58057, 580…
Due to suppressed data, any annual death rates calculated from the
daily measure do not equal death rates from the summarised data. This is
evidenced by taking the sum of all daily_deaths per
county and comparing to reported annual_deaths
in the dataset. Note no calculated deaths are greater than reported
deaths, as suppressed data is not summed.
A potentially interesting measure emerges, that there are some
counties which never experience a day with less than 9 deaths
(annual_deaths == calc_annual_deaths)
## compare reported annual deaths to calculated annual deaths
compare <-
left_join(TX_pop_2021,
TX_all %>%
filter(year == 2021) %>%
select(county_code, daily_county_deaths) %>%
group_by(county_code) %>%
mutate(daily_county_deaths = as.numeric(daily_county_deaths)) %>%
summarise(calc_annual_deaths = sum(daily_county_deaths, na.rm = TRUE))
) %>%
select(county_code, annual_deaths, calc_annual_deaths) %>%
mutate(lessOrEqual = ifelse(calc_annual_deaths <= annual_deaths, TRUE, FALSE))
comparison <- compare %>%
arrange(-annual_deaths)
DT::datatable(comparison)How have total deaths in the state trended over the years compared to population?
TX_all %>%
select(county, year, annual_county_population) %>%
distinct() %>%
group_by(year) %>%
summarise(TX_population = sum(annual_county_population)) %>%
mutate(year = as.factor(year)) %>%
left_join(
TX_all %>%
select(county, year, annual_county_deaths) %>%
distinct() %>%
group_by(year) %>%
summarise(total_deaths = sum(as.numeric(annual_county_deaths), na.rm = TRUE)) %>%
mutate(year = as.factor(year))
) %>%
pivot_longer(
cols = TX_population:total_deaths) %>%
ggplot(aes(x=year, y=value, group=name, fill=name)) +
# geom_col(position="dodge")
geom_point()+
geom_line()+
facet_wrap(~name, scales="free_y") +
scale_y_continuous(labels=comma)Do certain months show patterns year after year?
gg_1 <-
TX_all %>%
mutate_if(is.character, as.numeric) %>%
mutate(year = as.factor(year)) %>%
group_by(year, month) %>%
summarise(deaths = sum(daily_county_deaths, na.rm = TRUE)) %>%
# ungroup() %>%
ggplot(aes(x=month, y=deaths, color=year, group=year)) +
geom_point() +
geom_line(alpha=.8) +
labs(
title = "Deaths over time"
) +
scale_y_continuous(labels = comma) +
xlab("Month") +
ylab("Deaths")
ggplotly(gg_1)Every year the deaths increase, what about after normalizing against population?
# get TX population by year
TX_pop <- TX_all %>%
select(county, year, annual_county_population) %>%
distinct() %>%
group_by(year) %>%
summarise(TX_population = sum(annual_county_population)) %>%
mutate(year = as.factor(year))
gg_2 <- TX_all %>%
mutate_if(is.character, as.numeric) %>%
mutate(year = as.factor(year)) %>%
group_by(year, month) %>%
summarise(deaths = sum(daily_county_deaths, na.rm = TRUE)) %>%
left_join(TX_pop) %>%
mutate(pct_pop_deaths = deaths/TX_population) %>%
ggplot(aes(x=month, y=pct_pop_deaths, color=year, group=year)) +
geom_point() +
geom_line(alpha=.8) +
labs(
title = "Deaths over time",
subtitle = "Normalized for population"
) +
scale_y_continuous(labels = percent) +
xlab("Month") +
ylab("Deaths (percent of state population")
ggplotly(gg_2)Are the spikes seasonal or due to covid?
## visualize covid spike ----
gg_3 <- TX_all %>%
mutate_if(is.character, as.numeric) %>%
mutate(year = as.factor(year)) %>%
group_by(year, month) %>%
summarise(deaths = sum(daily_county_deaths, na.rm = TRUE)) %>%
left_join(TX_pop) %>%
mutate(
pct_pop_deaths = deaths/TX_population,
date = ymd(paste0(year,"-", month, "-01"))
) %>%
ggplot(aes(x=date, y=pct_pop_deaths, color=year)) +
geom_point()+
geom_line(alpha=.8)+
labs(
title = "Deaths over time",
subtitle = "Normalized for population"
) +
scale_y_continuous(labels = percent) +
scale_x_date(date_breaks = "4 months", date_labels = "%Y-%m") +
xlab("Month") +
ylab("Deaths (percent of state population")+
geom_rect(aes(xmin=ymd("2020-01-01"), xmax=ymd("2021-12-01"), ymin=0, ymax=.001), alpha=.005, fill='orange', color=NA) +
theme(legend.position = "none")
gg_3ggplotly(gg_3)If we look at how deaths have trended over time, on a per month basis, can we see any useful patterns at all?
TX_all %>%
mutate_if(is.character, as.numeric) %>%
mutate(year = as.factor(year)) %>%
group_by(year, month) %>%
summarise(deaths = sum(daily_county_deaths, na.rm = TRUE)) %>%
left_join(TX_pop) %>%
mutate(
pct_pop_deaths = deaths/TX_population,
date = ymd(paste0(year,"-", month, "-01"))
) %>%
ggplot(aes(x=year, y=deaths))+
geom_point()+
geom_line(group=1,alpha=.4)+
facet_wrap(~month)Are certain weekdays more deadly?
TX_all %>%
mutate_if(is.character, as.numeric) %>%
select(county, year, month, weekday, daily_county_deaths) %>%
group_by(weekday) %>%
summarise(total_deaths = sum(daily_county_deaths, na.rm = TRUE)) %>%
ggplot(aes(x=weekday, y=total_deaths))+
geom_point()+
geom_line(group=1)+
scale_y_continuous(labels=comma)And does this pattern hold true year by year?
TX_all %>%
mutate_if(is.character, as.numeric) %>%
select(county, year, month, weekday, daily_county_deaths) %>%
group_by(year, weekday) %>%
summarise(total_deaths = sum(daily_county_deaths, na.rm = TRUE)) %>%
ggplot(aes(x=weekday, y=total_deaths))+
geom_point()+
geom_line(group=1)+
scale_y_continuous(labels=comma)+
facet_wrap(~year, scale="free_y", ncol=1)